EnrichGT Documents

Author

Zhiming Ye

Published

December 18, 2024

Overview

The functions from EnrichGT are starting with “egt_” or “database_”.

graph LR
    subgraph Enrichment Analysis
        A[egt_enrichment_analysis]
        B[egt_gsea_analysis]
    end

    subgraph Pathway Databases
        D[database_* funcs]
    end

    subgraph Visualize results
        P1[egt_plot_results]
        P2[egt_plot_umap]
    end

    subgraph egt_recluster_analysis
        K1[Pretty table]
        CC[cluster modules]
        MG[gene modules]
    end

    subgraph Pathway Act. and TF infer 
        
        I[egt_infer]
    end

    D --> A
    D --> B

    A --> C[Enriched Result]
    B --> C

    C --> CC
    C --> MG

    C --> P1

    CC --> K1
    MG --> K1

    CC --> P1
    CC --> P2

    MG --> I

Install EnrichGT

install.packages("pak")
pak::pkg_install("ZhimingYe/EnrichGT")

or

install.packages("devtools")
library(devtools)
install_github("ZhimingYe/EnrichGT")

The AnnotationDbi, fgsea, reactome.db and GO.db were from BioConductor and might be slow to install. If you can’t install, please re-check your web connections or update your R and BioConductor, or use Posit Package Manager to install when using old R.

Meet EnrichGT

Important

See package function page for further information (For example, how to use a function)

You can use ? function to get further help when installed.

?egt_recluster_analysis

Core Function

Enrichment of genes

This is a C++ accelerated over representation analysis tool. The only things you need is your favourite gene symbols. If is all prepared, then load a database, run it!

Compared to the most popular clusterProfiler, the functions of EnrichGT differ slightly. This is mainly to accommodate wet lab researchers. First, most beginners are confused by the default input of clusterProfiler, which is “ENTREZ ID.” Most people familiar with biology are used to Gene Symbols, and even Ensembl IDs are not widely known, let alone a series of seemingly random numbers. Therefore, EnrichGT uses Gene Symbol as the default input, seamlessly integrating with most downstream results from various companies, making it more suitable for non-experts in the lab.

Second, clusterProfiler outputs an S4 object, which may be too complex for beginners (this is no joke); whereas EnrichGT outputs a simple table. The time of non-experts is precious, so I made these two adjustments. The only downside is that the GSEA peak plot is difficult to generate, but in reality, we focus more on NES and p-values, and in this case, bar plots are more convincing.

Furthermore, The pre-processing step of the hypergeometric test in EnrichGT’s ORA function (which determines overlap) is accelerated using hash tables in C++, making it over five times faster than clusterProfiler::enricher(), which is a pure R implementation.

res <- egt_enrichment_analysis(genes = DEGtable$Genes,
database = database_GO_BP())

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_GO_ALL())

res <- egt_enrichment_analysis(genes = c("TP53","CD169","CD68","CD163",
                                         "You can add more genes"),
database = database_from_gmt("MsigDB_Hallmark.gmt"))
library(dplyr)
library(tibble)
library(org.Hs.eg.db)
library(gt)
library(testthat)
library(withr)
library(EnrichGT)
library(readr)
DEGexample <- read_csv("./DEG.csv")
New names:
Rows: 15903 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
DEGexample_UpReg <- DEGexample |> dplyr::filter(pvalue<0.05,log2FoldChange>0.7)
ora_result <- egt_enrichment_analysis(genes = DEGexample_UpReg$...1,database = database_GO_BP(org.Hs.eg.db))

✔ success loaded database, time used : 17.0558180809021
head(ora_result)
          ID                                        Description GeneRatio
1 GO:0035249               synaptic transmission, glutamatergic    19/457
2 GO:0051966 regulation of synaptic transmission, glutamatergic    16/457
3 GO:0050804       modulation of chemical synaptic transmission    39/457
4 GO:0099177             regulation of trans-synaptic signaling    39/457
5 GO:0050808                               synapse organization    36/457
6 GO:0048168         regulation of neuronal synaptic plasticity    12/457
    BgRatio       pvalue     p.adjust       qvalue
1 111/18870 2.055491e-11 8.053235e-08 8.409591e-05
2  79/18870 5.742011e-11 8.053235e-08 8.409591e-05
3 489/18870 7.257625e-11 8.053235e-08 8.409591e-05
4 490/18870 7.711980e-11 8.053235e-08 8.409591e-05
5 483/18870 2.512088e-09 2.098598e-06 1.109159e-03
6  56/18870 7.497368e-09 4.183196e-06 1.109159e-03
                                                                                                                                                                                                                                  geneID
1                                                                                                                ATP1A2/GRIA4/GRID2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/DGKI/NRXN1/NLGN1/UNC13A/MAPK8IP2/CACNG5/UNC13C
2                                                                                                                                   ATP1A2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/DGKI/NRXN1/NLGN1/UNC13A/MAPK8IP2/CACNG5
3 ACHE/APOE/ATP1A2/CA2/CAMK2B/CDC20/GFAP/GRIA4/GRID2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/HRAS/MAP1B/NTRK2/SLC6A1/CNTN2/VGF/WNT5A/INA/DGKI/DLGAP1/NRXN1/RIMS3/NMU/NLGN1/UNC13A/MAPK8IP2/ERC2/CACNG5/LRFN2/UNC13C/SHISA9
4 ACHE/APOE/ATP1A2/CA2/CAMK2B/CDC20/GFAP/GRIA4/GRID2/GRIK2/GRIK3/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM1/GRM5/GRM8/HRAS/MAP1B/NTRK2/SLC6A1/CNTN2/VGF/WNT5A/INA/DGKI/DLGAP1/NRXN1/RIMS3/NMU/NLGN1/UNC13A/MAPK8IP2/ERC2/CACNG5/LRFN2/UNC13C/SHISA9
5             ACHE/APOE/KIF1A/CAMK2B/CDC20/CDH6/CTNNA2/DSCAM/GAP43/GRID2/GRIN2B/GRM5/MAP1B/NRCAM/NTRK2/RAC3/SIX1/SLC6A1/CNTN2/WNT5A/INA/NRXN1/NLGN1/UNC13A/ERC2/IL1RAPL2/SEZ6L2/TREM2/LRFN2/IGSF9/BCAN/SYNDIG1/DNER/ADGRF1/LHFPL4/UNC13C
6                                                                                                                                                                APOE/CAMK2B/GRIK2/GRIN1/GRIN2A/GRIN2B/GRIN2D/GRM5/HRAS/CNTN2/VGF/SHISA9
  Count
1    19
2    16
3    39
4    39
5    36
6    12
Have many sources of genes?

This function also support many groups of genes, you can input a list.

# For many groups of genes
res <- egt_enrichment_analysis(list(Macrophages=c("CD169","CD68","CD163"),
Fibroblast=c("COL1A2","COL1A3"),"You can add more groups"),
 database = database_from_gmt("panglaoDB.gmt"))

Enrichment of weighted genes (GSEA)

Genes with specific weights (e.g. the log2FC) can use GSEA method. It should input a pre-ranked geneset. This use C++ accelerated fgsea::fgsea() as backend, so it is also very fast.

How to build pre-ranked gene set?

genes_with_weights(genes,weights) function is used to build the pre-ranked gene set for GSEA analysis.

# From DEG analysis Results
res <- egt_gsea_analysis(genes = 
                           genes_with_weights(genes = DEG$genes, 
                                              weights = DEG$log2FoldChange),
                         database = database_GO_BP()
                         )

# From PCA
res <- egt_gsea_analysis(genes = genes_with_weights(genes = PCA_res$genes,
                                                    weights =PCA_res$PC1_loading),
                         database = database_from_gmt("MsigDB_Hallmark.gmt")
                         )

HTML reports (gt table)

Also, because of the messy result table is hardly to read, EnrichGT help you convert it into pretty gt HTML tables. This only supports the re-enriched results.

The gt_object is a pure object of gt package, you can use any function on it, like:

re_enrichment_results@gt_object |> gt_save("test.html") # Save it use basic gt functions. 

For further usage of gt package, please refer to https://gt.rstudio.com/articles/gt.html.

See re-enrichment example for further demo.

Ploting functions

Warning

The Dot Plot supports simple enrichment result data.frame and re-enriched egt_object, but UMAP plot only supports re-enriched egt_object.

HTML gt table satisfied most of things, but for others. Though we don’t want this package become complex (i.e., you can simple draw your figure using ggplot2 for enriched tables by yourself.) But we still provide limited figure ploting functions.

Dot Plot

egt_plot_results(re_enrich)

UMAP Plot

egt_plot_umap(re_enrich)
Warning: ggrepel: 18 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

DataBases Helpers

How to specify species?

EnrichGT use AnnotationDbi for this. you can use org.Hs.eg.db for human and org.Mm.eg.db for mouse. For others, please refer to BioConductor.

But for non-AnnotationDbi source database, you do not need to provide this, like database_CollecTRI_human() return database about human only.

Built in database form AnnotationDbi

You should add argument OrgDB for fetching them.

Example:

database_GO_BP(OrgDB = org.Hs.eg.db)

GO Database

database_GO_BP(), database_GO_CC(), database_GO_MF(), database_GO_ALL()

Reactome Database

database_Reactome()

Progeny Database

For pathway activity infer, database_progeny_human() and database_progeny_mouse()

CollecTRI Database

For Transcript Factors infer, database_CollecTRI_human() and database_CollecTRI_mouse()

Read Addition Gene Sets from local

EnrichGT supports reading GMT files, You can obtain GMT files from MsigDB.

database_from_gmt("Path_to_your_Gmt_file.gmt")

Where is KEGG?

KEGG limited the commercial usage. And you should use the KEGG REST API to download it. I have no time to achieve it now. But you can use KEGG Database from MsigDB instead (KEGG_MED and KEGG_Classical).

Reading is slow?

From 0.5.0, EnrichGT implemented a cache system. So when load a same database the second time, it will be much faster.

test <- database_GO_MF(org.Hs.eg.db)
✔ success loaded database, time used : 6.91346096992493
test_reload <- database_GO_MF(org.Hs.eg.db)
✔ Use cached database: GO_MF_org.Hs.eg.db

Future development plan

Version 0.5 is freeze in Dec 19th 2024.

Version 0.6 targets (Will be starting in 2024-12-25)

  • support KEGG online reading

  • Add simple network plot like cnetplot

  • Add drawing function for egt_infer()

  • Better dot plot

  • Self-built gene converter

Acknowledgement

This package is inspired by famous clusterProfiler. But since 0.5 version, the major enrichment functions of EnrichGT have replaced by the self-implemented functions, which provides a much light-weight experience. But without clusterProfiler, I won’t try to write this package.

If also using clusterProfiler ?

Please cite:

T Wu#, E Hu#, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo*, G Yu*. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141. doi: 10.1016/j.xinn.2021.100141